Research Title

“Utilizing Dynamic Harmonic Regression Models for Forecasting Complaints: A First-Case Study from the Call Center of Rome Capital, 2021-2022

Research Type

[MoD | method-over-data]


Abstract

Life in Rome is characterized by chaos and frenetic activity. The city spans a vast area with limited underground connections, and it faces challenges such as rural animal invasions, widespread traffic congestion, air pollution, and significant concerns regarding garbage collection, locally known as “monnezza”. This paper examines the organizational structure of efforts to address these issues, including the roles, number, and responsibilities of the figures involved. Data collected from various channels, including reports and complaints submitted to the Citizen’s Digital Home, Call Center, Help Desk, and Email, by both resident and non-resident citizens, tourists, and city users, are analyzed to provide insights influenced by the socio-political structure of Rome and the diverse objectives of the Call Center.

While initial attempts at clustering may not be straightforward, our research aims to delve deeper into the data to better understand the city’s issues and potentially support the efforts of city officials. This study seeks to answer questions regarding the organization of efforts and the expected outcomes, including the identification of key tasks and figures required.

Our primary objective will be prediction. Key challenges include capturing the seasonality of certain phenomena and understanding the temporal and socio-economical dynamics and keep in mind that some issues may not necessarily worsen over time. Indeed we’ll try forecasting complaints at the municipality through the utilization of Dynamic Harmonic Regression (DHR) models, as they allow for the systematic analysis of temporal patterns and trends, enabling parametric forecasting with enhanced accuracy and efficiency


Main research aim & framework

Municipality

The main goal is to predict where next complains will be and what will be about. It will be also of our interest to cluster them and to try to localize them in the municipalities of Rome.

Tasks

We would like to see also how the complains are distributed, if they are more or less distributed equally during the year or there are specific time-points where they peak and try to understand the reasons behind these peaks.

Secondary goals

We need to consider the limited duration of the accessibility of the data and its discontinuity across certain years.


The Genesis of Our Concept & Relevance

The idea basically originated from personal experience. Indeed, in Rome, everyone has the experience of calling 060606 for complaints - or thinking to do it, but after the first attempt - seeing that expectations are highly disappointed - further future attempts are abandoned. We may want to make a first step into the direction of properly understand and address the problems and why it is too hard to solve them. Up to know if you see a fallen tree, you will think: someone else will take care of it. In August, garbage overflows: it is normal, as it is known that employees are on vacation and so on. Also, for the bike lanes introduced with the Raggi administration almost 5 years ago, how can one not expect complaints from cyclists because it is still normal to find parked cars or from drivers because it is not fair to reduce the roads and increase traffic - they are indeed the greatest experts on how to manage and improve city traffic: allowing everyone to comfortably use their car. Of course, don’t you think?! Much more could be said, but we limit ourselves to this, and it should be clear to you how the idea for this project was born: we are tireless complainers :) .


Feasibility

Here just some document thought to store for future understanding of the topic - Kalman's filter - DHR Documentation Link


Papers (up to now!)

This article was interesting in the first place as shapes with differential tree model multiple data sets and aims to detect distributional differences between them. In particular clustered two types of fires in Australian countryside which matched the eventually task clustering in our case - at least refering to desired output.

This article talks about a forecasting problem and solves it using DHR (Dynamic Harmonic Regression) and SMC (Sequential Monte Carlo) methods. We liked this article because the problem is similar to ours and the methodology is explained step-by-step.

This article suggests a new approach to the DHR method. It enables to make better forecasts by using predictor variables inside the DHR in order to also capture the effects of contextual factors such as holidays and reminder e-mails.


Data source(s)

Data is collected and retrieved from the Open Data section of Roma Capitale, whose website can be accessed here. In particular, the dataset we have selected contains all cases, including calls and complaints, which also encompass those related to AMA, the public corporation responsible for coordinating and managing waste collection in Rome. These cases are collected through various channels such as CzRM (the Digital Home of the Citizen), Call Centers, Physical Complaints, and Email, and they involve both citizens and non-citizens of Rome, as well as city-users and tourists. The dataset is open source and can be found here

Potential lackness in data

Regarding the data collection process, as mentioned earlier, we will utilize previously collected data, thereby accessing it directly. However, given that this data originates from a new experimental open-source database of Roma Capitale, it will require preprocessing and cleaning. We anticipate encountering difficulties in rectifying the dataset, particularly concerning time dependencies, and will consider multiple approaches to address this, along with assessing their performances. Additionally, some feature engineering will be necessary.

In terms of size, the two datasets, “case open” and “case closed,” are in .csv format, with dimensions of approximately 800 and 11,300 KB and 7,050 and 13,000 rows, respectively, per year. Up to now, they can be easily expanded to cover two years. However, further expansions could be limiting up to the actual update state (missingness), which we will possibly avoid supporting the hypothesis that they would not help the results.

Furthermore, given that the two datasets offer different information, matching the information temporally may pose challenges. Our initial approach will primarily focus on the closed dataset, tentatively exploring the open dataset, and eventually describing the differences we expect to encounter.


Model & Methods

We plan to use the DHR (Dynamic Harmonic Regression) method for this project. When we searched the internet and the articles related to our prediction problem, the articles titled “Hierarchical Forecasting of Web Server Workload Using Sequential Monte_Carlo Training” and “Beyond the beaten paths of forecasting call center arrivals: on the use of dynamic harmonic regression with predictor variables” caught our attention the most. The DHR method is useful to make predictions using time series analysis while also capturing seasonality. In DHR, the main idea is to represent our time series data in a different way, using sinusoidal functions to be able to recognize patterns better; and then using a dynamic modelling technique. Since we are expecting to observe high seasonality in our dataset, we believe the DHR method will help us capture those affects. For the dynamic modelling part; we plan to use the Sequential Monte Carlo Method, as the authors did in the first of the above-mentioned article, because it is a flexible model especially when working with non-linear and non-Gaussian data such as ours. Additionally, we might include some of the predictor variables (most likely holidays) into our methods to test the method mentioned in the second article by comparing the results of the classic DHR and DHR with predictor variables.

Innovation in the topic

The main inherent difficulties of the problem revolve around the use of a dataset for which we have not found similar structures or research. Furthermore, knowledge of Rome and its socio-economic characteristics would be advisable. From this perspective, we rely on our “Romanity” (p.s: Although not by blood, Simay feels Roman at heart.), the sources available on the Roma Capitale website, and various statistics, particularly concerning the switchboard and the distribution of poverty by municipalities. This uncertainty and the presence of multiple factors are among the difficulties we seek to address.

Regarding the temporal horizon, should we succeed in configuring a predictive model, we are particularly interested in Dynamic Harmonic Regression (DHR) models. These models offer a dynamic framework for capturing temporal patterns and seasonality, which aligns well with the nature of our data and the challenges posed by the problem. By incorporating DHR models into our predictive analyses, we aim to enhance the accuracy and robustness of our forecasts, thus providing valuable insights for decision-makers and stakeholders. This strategic choice reflects our commitment to adopting advanced methodologies that can effectively address the complexities inherent in urban socio-economic systems, ultimately contributing to more informed policy-making and resource allocation efforts.

IML analysis

We also plan to use IML analysis. We plan to use Cluster Analysis and Temporal and Geospatial Analysis in order to have a better picture of the data and in order to create the best prediction model for new complaints. With respect to Cluster Analysis, we plan to use tree-based-clustering in order to identify similar patterns or groups within our data. This clustering analysis uncovers hidden structures and associations, guiding targeted interventions and personalized strategies. Furthermore, with geospatial and temporal analysis, we would like to explore the spatial and temporal distribution of the complains. We hope to uncover trends, hotspots, and patterns over time and space, in order to deal also with resource allocation.


Software/Hardware Toolkit

We plan to use both R and Python language. For the pre-processing, exploratory analysis and clustering we plan to use R and its packages such as, but not limited to: PCA, dhReg, caret, rpart, randomForest On the other hand, we plan to use Python in order to predict where the next complaints will be, and we plan to use its packages such as: Scikit-learn (DecisionTreeClassifier, DecisionTreeRegressor, GradientBoostingClassifier, GradientBoostingRegressor), XGBoost or LightGBM, CatBoost.

We’re orienting priviledging R for model’s implementation. We also have some code in the articles so we’re not excluding providing some self-implemented functions if necessary.


References

Portal Open Data

You can find on Moodle the list of main articles in .bib file.


Project Timeline

Concerning the timeline, we are almost sure that it will change over the course of the project but as for now, this would be the plan:

Here is a sketch of our timeline: